Module Project - Unsupervised Learning¶

Part 1 DOMAIN: Automobile¶


• CONTEXT: The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes
• DATA DESCRIPTION: The data concerns city-cycle fuel consumption in miles per gallon
• Attribute Information:

  1. mpg: continuous
  2. cylinders: multi-valued discrete
  3. displacement: continuous
  4. horsepower: continuous
  5. weight: continuous
  6. acceleration: continuous
  7. model year: multi-valued discrete
  8. origin: multi-valued discrete
  9. car name: string (unique for each instance)

PROJECT OBJECTIVE: Goal is to cluster the data and treat them as individual datasets to train Regression models to predict ‘mpg’

1.Import and warehouse data: [ Score: 3 points ]
• Import all the given datasets and explore shape and size.
• Merge all datasets onto one and explore inal shape and size.
• Export the inal dataset and store it on local machine in .csv, .xlsx and .json format for future use.
• Import the data from above steps into python.

In [1]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [ ]:
# Import Pythin libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#import pandas_profiling

import warnings
warnings.filterwarnings('ignore')

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline

from sklearn.preprocessing import StandardScaler
In [ ]:
# Import First dataset and check shape
df_CarName = pd.read_csv("/content/drive/MyDrive/AI_ML_course/Unsupervised_Learning/Car name.csv")
df_CarName.sample(5)
df_CarName.shape
Out[ ]:
car_name
95 buick electra 225 custom
283 amc concord dl 6
307 oldsmobile omega brougham
71 mazda rx2 coupe
334 mazda rx-7 gs
Out[ ]:
(398, 1)
In [ ]:
# Import First dataset and check shape
df_CarAtr = pd.read_json("/content/drive/MyDrive/AI_ML_course/Unsupervised_Learning/Car-Attributes.json")
df_CarAtr.sample(5)
df_CarAtr.shape
Out[ ]:
mpg cyl disp hp wt acc yr origin
76 18.0 4 121.0 112 2933 14.5 72 2
357 32.9 4 119.0 100 2615 14.8 81 3
74 13.0 8 302.0 140 4294 16.0 72 1
312 37.2 4 86.0 65 2019 16.4 80 3
214 13.0 8 302.0 130 3870 15.0 76 1
Out[ ]:
(398, 8)
<Figure size 1200x600 with 0 Axes>
In [ ]:
# Combine two Dataframes
frames = [df_CarName,df_CarAtr]
df_mpgRaw = pd.concat(frames,axis=1,verify_integrity=True)
df_mpgRaw.sample(5)
df_mpgRaw.shape
Out[ ]:
car_name mpg cyl disp hp wt acc yr origin
341 chevrolet citation 23.5 6 173.0 110 2725 12.6 81 1
242 bmw 320i 21.5 4 121.0 110 2600 12.8 77 2
122 saab 99le 24.0 4 121.0 110 2660 14.0 73 2
26 chevy c20 10.0 8 307.0 200 4376 15.0 70 1
299 peugeot 504 27.2 4 141.0 71 3190 24.8 79 2
Out[ ]:
(398, 9)
In [ ]:
# Save Raw datasets in Excel, CSV and Json for future use
df_mpgRaw.to_csv('Cardata.csv',index=False)
#df_mpgRaw.to_excel('Cardata.xls', index=False)
df_mpgRaw.to_json('Cardata.json')
In [ ]:
# Import data from files
mpgRaw_csv = pd.read_csv("Cardata.csv")
#mpgRaw_excel = pd.read_excel("Cardata.xls")
mpgRaw_json = pd.read_json("Cardata.json")
In [ ]:
# Check info of data
mpgRaw_csv.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   car_name  398 non-null    object 
 1   mpg       398 non-null    float64
 2   cyl       398 non-null    int64  
 3   disp      398 non-null    float64
 4   hp        398 non-null    object 
 5   wt        398 non-null    int64  
 6   acc       398 non-null    float64
 7   yr        398 non-null    int64  
 8   origin    398 non-null    int64  
dtypes: float64(3), int64(4), object(2)
memory usage: 28.1+ KB
In [ ]:
# Checking standard null values in data
mpgRaw_csv.isnull().sum()
Out[ ]:
car_name    0
mpg         0
cyl         0
disp        0
hp          0
wt          0
acc         0
yr          0
origin      0
dtype: int64
In [ ]:
print(mpgRaw_csv.iloc[330])
car_name    renault lecar deluxe
mpg                         40.9
cyl                            4
disp                        85.0
hp                             ?
wt                          1835
acc                         17.3
yr                            80
origin                         2
Name: 330, dtype: object
In [ ]:
# Change Hp datatype to float and Create Final dataframe
CarForMpg = mpgRaw_csv.copy()
CarForMpg.shape
CarForMpg['hp'] =CarForMpg['hp'].apply(pd.to_numeric, errors="coerce")
# Checking null values
CarForMpg["hp"][CarForMpg["hp"].isnull()]
Out[ ]:
(398, 9)
Out[ ]:
32    NaN
126   NaN
330   NaN
336   NaN
354   NaN
374   NaN
Name: hp, dtype: float64
In [ ]:
print(CarForMpg["hp"].median())
93.5
In [ ]:
# Replace Null Values with median values
CarForMpg.fillna(CarForMpg.median(), inplace=True)

Exploratory Data Analysis¶

In [ ]:
# Describe the numerical data
CarForMpg.describe().T
Out[ ]:
count mean std min 25% 50% 75% max
mpg 398.0 23.514573 7.815984 9.0 17.500 23.0 29.000 46.6
cyl 398.0 5.454774 1.701004 3.0 4.000 4.0 8.000 8.0
disp 398.0 193.425879 104.269838 68.0 104.250 148.5 262.000 455.0
hp 398.0 104.304020 38.222625 46.0 76.000 93.5 125.000 230.0
wt 398.0 2970.424623 846.841774 1613.0 2223.750 2803.5 3608.000 5140.0
acc 398.0 15.568090 2.757689 8.0 13.825 15.5 17.175 24.8
yr 398.0 76.010050 3.697627 70.0 73.000 76.0 79.000 82.0
origin 398.0 1.572864 0.802055 1.0 1.000 1.0 2.000 3.0
In [ ]:
CarForMpg.dtypes
Out[ ]:
car_name     object
mpg         float64
cyl           int64
disp        float64
hp          float64
wt            int64
acc         float64
yr            int64
origin        int64
dtype: object
In [ ]:
sns.pairplot(CarForMpg);
In [ ]:
# Scaling the data before K Means
from scipy.stats import zscore
CarForMpg_model = CarForMpg.drop("car_name", axis=1).apply(zscore)
CarForMpg_model.head()
Out[ ]:
mpg cyl disp hp wt acc yr origin
0 -0.706439 1.498191 1.090604 0.673118 0.630870 -1.295498 -1.627426 -0.715145
1 -1.090751 1.498191 1.503514 1.589958 0.854333 -1.477038 -1.627426 -0.715145
2 -0.706439 1.498191 1.196232 1.197027 0.550470 -1.658577 -1.627426 -0.715145
3 -0.962647 1.498191 1.061796 1.197027 0.546923 -1.295498 -1.627426 -0.715145
4 -0.834543 1.498191 1.042591 0.935072 0.565841 -1.840117 -1.627426 -0.715145
In [ ]:
CarForMpg_model["hp"].describe()
Out[ ]:
count    3.980000e+02
mean    -7.141133e-17
std      1.001259e+00
min     -1.527300e+00
25%     -7.414364e-01
50%     -2.830161e-01
75%      5.421404e-01
max      3.292662e+00
Name: hp, dtype: float64
In [ ]:
#Finding optimal no. of clusters
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
clusters=range(1,20)
meanDistortions=[]

for k in clusters:
    model=KMeans(n_clusters=k);
    model.fit(CarForMpg_model);
    prediction=model.predict(CarForMpg_model);
    meanDistortions.append(sum(np.min(cdist(CarForMpg_model, model.cluster_centers_, 'euclidean'), axis=1)) / CarForMpg_model.shape[0]);


plt.plot(clusters, meanDistortions, 'bx-');
plt.xlabel('k');
plt.ylabel('Average distortion');
plt.title('Selecting k with the Elbow Method');

Observation:¶

It can be seen that slope change (Elbow exists) at Cluster size of 5 (k =5) Selecting that for further study

In [ ]:
# With K = 5
#Finding optimal no. of clusters
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
final_model=KMeans(n_clusters=5)
final_model.fit(CarForMpg_model)
prediction=final_model.predict(CarForMpg_model)

#Append the prediction
CarForMpg["GROUP"] = prediction;
CarForMpg_model["GROUP"] = prediction;
print("Groups Assigned : \n");
CarForMpg.sample(5)
Out[ ]:
KMeans(n_clusters=5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=5)
Groups Assigned : 

Out[ ]:
car_name mpg cyl disp hp wt acc yr origin GROUP
282 ford fairmont 4 22.3 4 140.0 88.0 2890 17.3 79 1 1
251 mercury monarch ghia 20.2 8 302.0 139.0 3570 12.8 78 1 0
359 peugeot 505s turbo diesel 28.1 4 141.0 80.0 3230 20.4 81 2 1
260 dodge aspen 18.6 6 225.0 110.0 3620 18.7 78 1 3
235 toyota corolla liftback 26.0 4 97.0 75.0 2265 18.2 77 3 2
In [ ]:
CarForMpgClust = CarForMpg.groupby(['GROUP'])
CarForMpgClust.mean()
Out[ ]:
mpg cyl disp hp wt acc yr origin
GROUP
0 14.429787 8.000000 350.042553 162.393617 4157.978723 12.576596 73.468085 1.000000
1 28.791045 4.194030 132.567164 82.865672 2563.805970 16.549254 79.671642 1.074627
2 34.137500 4.083333 99.527778 72.875000 2155.819444 16.734722 79.416667 2.763889
3 19.104938 6.222222 233.444444 101.882716 3298.580247 16.632099 75.703704 1.037037
4 24.619048 4.047619 108.601190 85.672619 2347.619048 16.107143 73.309524 2.107143
In [ ]:
CarForMpg_model.boxplot(by='GROUP', layout = (2,4),figsize=(15,10));

Using Hierarchical Clustering¶

In [ ]:
#### generate the linkage matrix
from scipy.cluster.hierarchy import dendrogram, linkage
CarForMpg_model = CarForMpg_model.drop("GROUP",axis=1)
Z = linkage(CarForMpg_model, 'ward', metric='euclidean')
Z.shape
Out[ ]:
(397, 4)
In [ ]:
# Plot Dendrogram
plt.figure(figsize=(25, 10));
dendrogram(Z);
plt.show();

Recreate the dendrogram for last 10 merged clusters¶

In [ ]:
# Hint: Use truncate_mode='lastp' attribute in dendrogram function to arrive at dendrogram
dendrogram(
    Z,
    truncate_mode='lastp',  # show only the last p merged clusters
    p=10,  # show only the last p merged clusters
);
plt.show();
In [ ]:
#Finding Maximum distance and then clusering again
max_d = 5
from scipy.cluster.hierarchy import fcluster
clusters = fcluster(Z, max_d, criterion='distance')
clusters
Out[ ]:
array([20, 20, 20, 20, 20, 19, 19, 19, 19, 20, 20, 20, 20, 19,  3, 15, 15,
       15,  3,  2,  2,  4,  2,  4, 15, 22, 22, 22, 22,  3,  1,  3,  2, 15,
       15, 15, 15, 15, 22, 22, 21, 21, 22, 22, 22, 15,  2, 15, 15,  1,  4,
        2,  4, 13, 13,  2,  2,  3,  1,  2,  2,  1, 22, 22, 21, 21, 20, 19,
       22, 22, 22,  3, 21, 21, 21, 21,  6,  2,  2,  2,  1,  3,  3,  1,  3,
       22, 20, 21, 21, 21, 19, 22, 22, 21, 19, 19, 20, 15, 16, 15, 15, 15,
        2, 22, 22, 22, 22, 15,  3,  2,  3,  3,  2, 15,  4, 21, 19,  2,  4,
        6,  6, 20,  6,  6, 20, 15, 15, 15, 16, 13,  1, 13,  1, 16, 16, 16,
       21, 21, 21, 21, 21,  4,  4,  4, 13, 13,  1,  4,  4,  3,  3,  4, 16,
       16, 16, 16, 22, 21, 21, 21, 16, 16, 16, 16, 17, 18, 18, 13,  1, 15,
        1,  5,  4,  3, 15, 10, 16,  6,  6,  6,  6, 13,  4,  4,  1,  1,  4,
       18, 18, 18, 18, 17, 17, 17, 17,  2,  2, 10, 13, 17, 16, 16, 16, 10,
       13, 13,  1,  6, 18, 12,  6,  6, 22, 18, 18, 18, 13,  9, 10,  1, 13,
       18, 16, 18, 18, 16, 17, 17, 17, 22, 22, 22, 18, 10,  1, 13,  1,  9,
        9, 13, 10,  6,  6,  5, 11,  9, 13, 14, 14, 18, 18, 18, 17, 17, 17,
        1, 17, 17, 17, 17, 17, 17, 18, 18, 18, 18,  9,  5,  5,  9,  5,  1,
        1,  5,  6,  6,  6,  6, 10, 13, 17, 17,  1, 17, 17, 18, 18, 18, 18,
       18, 18, 18, 18, 10, 14,  9,  7, 12, 17, 12, 17,  9,  9, 13, 10,  7,
        8,  8,  8, 10, 14,  9, 14,  7,  7,  7, 17, 10, 14, 14, 14, 14, 14,
        7, 14, 11, 11, 12, 12, 14, 10, 14, 10,  5,  5, 10,  7, 14,  7,  7,
        7,  8,  8, 14,  9, 14, 14, 14, 14, 14,  9,  9,  7, 10, 10, 14, 14,
       14, 14, 12, 12,  5,  5, 17, 17, 17, 17,  7,  7,  7,  7,  7,  7,  7,
        7, 10, 14, 14,  9,  9, 14, 14, 14, 14, 14, 14, 17,  7,  7, 17, 14,
        8,  7,  7, 11,  8,  7,  7], dtype=int32)
In [ ]:
# Visualizing Clusters with
#### plt.figure(figsize=(10, 8))
np.random.seed(101)  # for repeatability of this dataset
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[150,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[98,])
c = np.random.multivariate_normal([10, 20], [[3, 1], [1, 4]], size=[150,])
X = np.concatenate((a, b, c), axis=0)
print(X.shape)  # 398 samples with 2 dimensions
plt.scatter(X[:,0], X[:,1], c=clusters)  # plot points with cluster dependent colors
plt.show()
(398, 2)
Out[ ]:
<matplotlib.collections.PathCollection at 0x7f22173b8790>
In [ ]:
 

Part 2 DOMAIN: Manufacturing¶


CONTEXT: Company X curates and packages wine across various vineyards spread throughout the country.
DATA DESCRIPTION: The data concerns the chemical composition of the wine and its respective quality.
Attribute Information:

  1. A, B, C, D: speci ic chemical composition measure of the wine
  2. Quality: quality of wine [ Low and High ]

PROJECT OBJECTIVE: Goal is to build a synthetic data generation model using the existing data provided by the company.

Steps and tasks: [ Total Score: 5 points]

  1. Design a synthetic data generation model which can impute values [Attribute: Quality] wherever empty the company has missed recording the data
In [ ]:
# Import First dataset and check shape
df_WineQuality = pd.read_excel("Part2 - Company.xlsx")
df_WineQuality.head(5)
df_WineQuality.shape
df_WineQuality["Quality"].unique()
Out[ ]:
A B C D Quality
0 47 27 45 108 Quality A
1 174 133 134 166 Quality B
2 159 163 135 131 NaN
3 61 23 3 44 Quality A
4 59 60 9 68 Quality A
Out[ ]:
(61, 5)
Out[ ]:
array(['Quality A', 'Quality B', nan], dtype=object)
In [ ]:
# Checking standard null values in data
df_WineQuality.isnull().sum()
Out[ ]:
A           0
B           0
C           0
D           0
Quality    18
dtype: int64
In [ ]:
# Apply label encoder
from sklearn.preprocessing import LabelEncoder
_encoder = LabelEncoder()
_cols = df_WineQuality.columns
df_WineQuality[_cols] = df_WineQuality[_cols].apply(_encoder.fit_transform)
df_WineQuality["Quality"].unique()
Out[ ]:
array([0, 1, 2])
In [ ]:
df_WineQuality.head()
Out[ ]:
A B C D Quality
0 10 5 10 24 0
1 41 26 25 39 1
2 34 39 26 27 2
3 13 4 0 11 0
4 12 12 3 17 0

A values of 2 is assigned to NAN value by Label Encoder. Lets convert it to NaN so that these can be imputed with KNN¶

In [ ]:
df_WineQuality["Quality"] = df_WineQuality["Quality"].replace(2,np.nan)
In [ ]:
# Checking standard null values in data
df_WineQuality.isnull().sum()
Out[ ]:
A           0
B           0
C           0
D           0
Quality    18
dtype: int64
In [ ]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2);
imputer.fit_transform(df_WineQuality);

Part 3 DOMAIN: Automobile¶


CONTEXT: The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.
DATA DESCRIPTION: The data contains features extracted from the silhouette of vehicles in different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more dif icult to distinguish between the cars.
All the features are numeric i.e. geometric features extracted from the silhouette.

PROJECT OBJECTIVE: Apply dimensionality reduction technique – PCA and train a model using principal components instead of training the model using just the raw data.

Steps and tasks: [ Total Score: 20 points]

  1. Data: Import, clean and pre-process the data
  2. EDA and visualisation: Create a detailed performance report using univariate, bi-variate and multivariate EDA techniques. Find out all possible hidden patterns by using all possible methods. For example: Use your best analytical approach to build this report. Even you can mix match columns to create new columns which can be used for better analysis. Create your own features if required. Be highly experimental and analytical here to ind hidden patterns.
  3. Classi ier: Design and train a best it SVM classier using all the data attributes.
  4. Dimensional reduction: perform dimensional reduction on the data.
  5. Classi ier: Design and train a best it SVM classier using dimensionally reduced attributes.
  6. Conclusion: Showcase key pointer on how dimensional reduction helped in this case

Step 1: Data: Import, clean and pre-process the data

In [ ]:
# Import First dataset and check shape
df_Auto = pd.read_csv("Part3 - vehicle.csv")
df_Auto.sample(5)
df_Auto.shape
Out[ ]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
189 90 36.0 78.0 179.0 64.0 8 157.0 42.0 19.0 126 182.0 367.0 142.0 66.0 1.0 20.0 192.0 198 car
74 89 42.0 89.0 147.0 61.0 11 151.0 44.0 19.0 145 170.0 338.0 163.0 72.0 11.0 23.0 187.0 199 van
548 94 39.0 75.0 184.0 72.0 8 155.0 42.0 19.0 133 175.0 365.0 145.0 70.0 4.0 5.0 192.0 200 bus
822 95 41.0 82.0 170.0 65.0 9 145.0 46.0 19.0 145 163.0 314.0 140.0 64.0 4.0 8.0 199.0 207 van
235 90 48.0 78.0 134.0 56.0 11 160.0 43.0 20.0 167 169.0 366.0 185.0 76.0 1.0 14.0 182.0 192 van
Out[ ]:
(846, 19)
In [ ]:
# Check info of data
df_Auto.info();
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 846 entries, 0 to 845
Data columns (total 19 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   compactness                  846 non-null    int64  
 1   circularity                  841 non-null    float64
 2   distance_circularity         842 non-null    float64
 3   radius_ratio                 840 non-null    float64
 4   pr.axis_aspect_ratio         844 non-null    float64
 5   max.length_aspect_ratio      846 non-null    int64  
 6   scatter_ratio                845 non-null    float64
 7   elongatedness                845 non-null    float64
 8   pr.axis_rectangularity       843 non-null    float64
 9   max.length_rectangularity    846 non-null    int64  
 10  scaled_variance              843 non-null    float64
 11  scaled_variance.1            844 non-null    float64
 12  scaled_radius_of_gyration    844 non-null    float64
 13  scaled_radius_of_gyration.1  842 non-null    float64
 14  skewness_about               840 non-null    float64
 15  skewness_about.1             845 non-null    float64
 16  skewness_about.2             845 non-null    float64
 17  hollows_ratio                846 non-null    int64  
 18  class                        846 non-null    object 
dtypes: float64(14), int64(4), object(1)
memory usage: 125.7+ KB
In [ ]:
# Checking standard null values in data
df_Auto.isnull().sum()
Out[ ]:
compactness                    0
circularity                    5
distance_circularity           4
radius_ratio                   6
pr.axis_aspect_ratio           2
max.length_aspect_ratio        0
scatter_ratio                  1
elongatedness                  1
pr.axis_rectangularity         3
max.length_rectangularity      0
scaled_variance                3
scaled_variance.1              2
scaled_radius_of_gyration      2
scaled_radius_of_gyration.1    4
skewness_about                 6
skewness_about.1               1
skewness_about.2               1
hollows_ratio                  0
class                          0
dtype: int64
In [ ]:
# Imputing Null values with median values
df_Auto = df_Auto.fillna(df_Auto.median())

Step 2. EDA and visualisation: Create a detailed performance report using univariate, bi-variate and multivariate EDA techniques. Find out all possible hidden patterns by using all possible methods. For example: Use your best analytical approach to build this report. Even you can mix match columns to create new columns which can be used for better analysis. Create your own features if required. Be highly experimental and analytical here to ind hidden patterns.

a) Univariate and Bivariate Analysis

In [ ]:
# Describe the numerical data
df_Auto.describe().T
Out[ ]:
count mean std min 25% 50% 75% max
compactness 846.0 93.678487 8.234474 73.0 87.00 93.0 100.00 119.0
circularity 846.0 44.823877 6.134272 33.0 40.00 44.0 49.00 59.0
distance_circularity 846.0 82.100473 15.741569 40.0 70.00 80.0 98.00 112.0
radius_ratio 846.0 168.874704 33.401356 104.0 141.00 167.0 195.00 333.0
pr.axis_aspect_ratio 846.0 61.677305 7.882188 47.0 57.00 61.0 65.00 138.0
max.length_aspect_ratio 846.0 8.567376 4.601217 2.0 7.00 8.0 10.00 55.0
scatter_ratio 846.0 168.887707 33.197710 112.0 147.00 157.0 198.00 265.0
elongatedness 846.0 40.936170 7.811882 26.0 33.00 43.0 46.00 61.0
pr.axis_rectangularity 846.0 20.580378 2.588558 17.0 19.00 20.0 23.00 29.0
max.length_rectangularity 846.0 147.998818 14.515652 118.0 137.00 146.0 159.00 188.0
scaled_variance 846.0 188.596927 31.360427 130.0 167.00 179.0 217.00 320.0
scaled_variance.1 846.0 439.314421 176.496341 184.0 318.25 363.5 586.75 1018.0
scaled_radius_of_gyration 846.0 174.706856 32.546277 109.0 149.00 173.5 198.00 268.0
scaled_radius_of_gyration.1 846.0 72.443262 7.468734 59.0 67.00 71.5 75.00 135.0
skewness_about 846.0 6.361702 4.903244 0.0 2.00 6.0 9.00 22.0
skewness_about.1 846.0 12.600473 8.930962 0.0 5.00 11.0 19.00 41.0
skewness_about.2 846.0 188.918440 6.152247 176.0 184.00 188.0 193.00 206.0
hollows_ratio 846.0 195.632388 7.438797 181.0 190.25 197.0 201.00 211.0
In [ ]:
# Pairplot for univariate and Bivariate analysis
sns.pairplot(df_Auto, hue="class", diag_kind='kde');

Step 3 and 4 SVM with and Without Dimensional reduction: perform dimensional reduction on the data.

In [ ]:
# Standardizing the data for SVM and PCA both
from scipy.stats import zscore
df_Auto_Scaled = df_Auto.drop('class', axis=1)
df_Auto_Scaled=df_Auto_Scaled.apply(zscore)
df_Auto_Scaled.head()
Out[ ]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio
0 0.160580 0.518073 0.057177 0.273363 1.310398 0.311542 -0.207598 0.136262 -0.224342 0.758332 -0.401920 -0.341934 0.285705 -0.327326 -0.073812 0.380870 -0.312012 0.183957
1 -0.325470 -0.623732 0.120741 -0.835032 -0.593753 0.094079 -0.599423 0.520519 -0.610886 -0.344578 -0.593357 -0.619724 -0.513630 -0.059384 0.538390 0.156798 0.013265 0.452977
2 1.254193 0.844303 1.519141 1.202018 0.548738 0.311542 1.148719 -1.144597 0.935290 0.689401 1.097671 1.109379 1.392477 0.074587 1.558727 -0.403383 -0.149374 0.049447
3 -0.082445 -0.623732 -0.006386 -0.295813 0.167907 0.094079 -0.750125 0.648605 -0.610886 -0.344578 -0.912419 -0.738777 -1.466683 -1.265121 -0.073812 -0.291347 1.639649 1.529056
4 -1.054545 -0.134387 -0.769150 1.082192 5.245643 9.444962 -0.599423 0.520519 -0.610886 -0.275646 1.671982 -0.648070 0.408680 7.309005 0.538390 -0.179311 -1.450481 -1.699181
In [ ]:
from sklearn.decomposition import PCA
pca = PCA(n_components=18)
pca.fit(df_Auto_Scaled)
Out[ ]:
PCA(n_components=18)
In [ ]:
# Eigen values
print(pca.explained_variance_)
[9.40460261e+00 3.01492206e+00 1.90352502e+00 1.17993747e+00
 9.17260633e-01 5.39992629e-01 3.58870118e-01 2.21932456e-01
 1.60608597e-01 9.18572234e-02 6.64994118e-02 4.66005994e-02
 3.57947189e-02 2.74120657e-02 2.05792871e-02 1.79166314e-02
 1.00257898e-02 2.96445743e-03]
In [ ]:
# Explainabilty by each eigen vector
print(pca.explained_variance_ratio_)
[5.21860337e-01 1.67297684e-01 1.05626388e-01 6.54745969e-02
 5.08986889e-02 2.99641300e-02 1.99136623e-02 1.23150069e-02
 8.91215289e-03 5.09714695e-03 3.69004485e-03 2.58586200e-03
 1.98624491e-03 1.52109243e-03 1.14194232e-03 9.94191854e-04
 5.56329946e-04 1.64497408e-04]
In [ ]:
plt.bar(list(range(1,19)),pca.explained_variance_ratio_,alpha=0.5, align='center');
plt.ylabel('Variation explained');
plt.xlabel('eigen Value');
plt.show();
In [ ]:
plt.step(list(range(1,19)),np.cumsum(pca.explained_variance_ratio_), where='mid');
plt.ylabel('Cum of variation explained');
plt.xlabel('eigen Value');
plt.show();

Observation¶

It can be seen that only 10 Dimensions are sufficient to explain almost 98% variablilty of the data. Hence selecting 10 PCA's

Reducing deminsions to 10 for PCA using above observation¶

In [ ]:
pca10 = PCA(n_components=10)
pca10.fit(df_Auto_Scaled)
Xpca10 = pca10.transform(df_Auto_Scaled)
Xpca10.shape
Out[ ]:
PCA(n_components=10)
Out[ ]:
(846, 10)

Training SVM model without PCA¶

In [ ]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support, mean_absolute_error
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
In [ ]:
# Transform data into features and target
X_org = df_Auto_Scaled
y = df_Auto['class'].astype('category')
In [ ]:
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X_org, y, test_size=0.2, random_state=7)
In [ ]:
autosvm = SVC(gamma=0.025, C=3)
autosvm.fit(X_train , y_train)
y_pred = autosvm.predict(X_test)
Out[ ]:
SVC(C=3, gamma=0.025)
In [ ]:
# Evaluate accuracy
print(accuracy_score(y_test, y_pred))
0.9764705882352941
In [ ]:
# Performing Grid search for Best hyperparameters

from sklearn.model_selection import GridSearchCV

# defining parameter range
param_grid = {'C': [0.1, 1, 10],
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
              'kernel': ['rbf']};

grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 3);

# fitting the model for grid search
grid.fit(X_train, y_train);

# print best parameter after tuning
print(grid.best_params_)

# print how our model looks after hyper-parameter tuning
print(grid.best_estimator_)
Fitting 5 folds for each of 15 candidates, totalling 75 fits
[CV 1/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.500 total time=   0.1s
[CV 2/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.511 total time=   0.0s
[CV 3/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.511 total time=   0.0s
[CV 4/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.511 total time=   0.0s
[CV 5/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.511 total time=   0.0s
[CV 1/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.860 total time=   0.0s
[CV 2/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.867 total time=   0.0s
[CV 3/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.815 total time=   0.0s
[CV 4/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.837 total time=   0.0s
[CV 5/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.807 total time=   0.0s
[CV 1/5] END .....C=0.1, gamma=0.01, kernel=rbf;, score=0.500 total time=   0.0s
[CV 2/5] END .....C=0.1, gamma=0.01, kernel=rbf;, score=0.511 total time=   0.0s
[CV 3/5] END .....C=0.1, gamma=0.01, kernel=rbf;, score=0.511 total time=   0.0s
[CV 4/5] END .....C=0.1, gamma=0.01, kernel=rbf;, score=0.511 total time=   0.0s
[CV 5/5] END .....C=0.1, gamma=0.01, kernel=rbf;, score=0.511 total time=   0.0s
[CV 1/5] END ....C=0.1, gamma=0.001, kernel=rbf;, score=0.500 total time=   0.0s
[CV 2/5] END ....C=0.1, gamma=0.001, kernel=rbf;, score=0.511 total time=   0.0s
[CV 3/5] END ....C=0.1, gamma=0.001, kernel=rbf;, score=0.511 total time=   0.0s
[CV 4/5] END ....C=0.1, gamma=0.001, kernel=rbf;, score=0.511 total time=   0.0s
[CV 5/5] END ....C=0.1, gamma=0.001, kernel=rbf;, score=0.511 total time=   0.0s
[CV 1/5] END ...C=0.1, gamma=0.0001, kernel=rbf;, score=0.500 total time=   0.0s
[CV 2/5] END ...C=0.1, gamma=0.0001, kernel=rbf;, score=0.511 total time=   0.0s
[CV 3/5] END ...C=0.1, gamma=0.0001, kernel=rbf;, score=0.511 total time=   0.0s
[CV 4/5] END ...C=0.1, gamma=0.0001, kernel=rbf;, score=0.511 total time=   0.0s
[CV 5/5] END ...C=0.1, gamma=0.0001, kernel=rbf;, score=0.511 total time=   0.0s
[CV 1/5] END ..........C=1, gamma=1, kernel=rbf;, score=0.779 total time=   0.0s
[CV 2/5] END ..........C=1, gamma=1, kernel=rbf;, score=0.800 total time=   0.0s
[CV 3/5] END ..........C=1, gamma=1, kernel=rbf;, score=0.815 total time=   0.0s
[CV 4/5] END ..........C=1, gamma=1, kernel=rbf;, score=0.741 total time=   0.0s
[CV 5/5] END ..........C=1, gamma=1, kernel=rbf;, score=0.793 total time=   0.0s
[CV 1/5] END ........C=1, gamma=0.1, kernel=rbf;, score=0.985 total time=   0.0s
[CV 2/5] END ........C=1, gamma=0.1, kernel=rbf;, score=0.956 total time=   0.0s
[CV 3/5] END ........C=1, gamma=0.1, kernel=rbf;, score=0.963 total time=   0.0s
[CV 4/5] END ........C=1, gamma=0.1, kernel=rbf;, score=0.948 total time=   0.0s
[CV 5/5] END ........C=1, gamma=0.1, kernel=rbf;, score=0.956 total time=   0.0s
[CV 1/5] END .......C=1, gamma=0.01, kernel=rbf;, score=0.963 total time=   0.0s
[CV 2/5] END .......C=1, gamma=0.01, kernel=rbf;, score=0.941 total time=   0.0s
[CV 3/5] END .......C=1, gamma=0.01, kernel=rbf;, score=0.919 total time=   0.0s
[CV 4/5] END .......C=1, gamma=0.01, kernel=rbf;, score=0.933 total time=   0.0s
[CV 5/5] END .......C=1, gamma=0.01, kernel=rbf;, score=0.904 total time=   0.0s
[CV 1/5] END ......C=1, gamma=0.001, kernel=rbf;, score=0.500 total time=   0.0s
[CV 2/5] END ......C=1, gamma=0.001, kernel=rbf;, score=0.519 total time=   0.0s
[CV 3/5] END ......C=1, gamma=0.001, kernel=rbf;, score=0.526 total time=   0.0s
[CV 4/5] END ......C=1, gamma=0.001, kernel=rbf;, score=0.511 total time=   0.0s
[CV 5/5] END ......C=1, gamma=0.001, kernel=rbf;, score=0.511 total time=   0.0s
[CV 1/5] END .....C=1, gamma=0.0001, kernel=rbf;, score=0.500 total time=   0.0s
[CV 2/5] END .....C=1, gamma=0.0001, kernel=rbf;, score=0.511 total time=   0.0s
[CV 3/5] END .....C=1, gamma=0.0001, kernel=rbf;, score=0.511 total time=   0.0s
[CV 4/5] END .....C=1, gamma=0.0001, kernel=rbf;, score=0.511 total time=   0.0s
[CV 5/5] END .....C=1, gamma=0.0001, kernel=rbf;, score=0.511 total time=   0.0s
[CV 1/5] END .........C=10, gamma=1, kernel=rbf;, score=0.809 total time=   0.0s
[CV 2/5] END .........C=10, gamma=1, kernel=rbf;, score=0.800 total time=   0.0s
[CV 3/5] END .........C=10, gamma=1, kernel=rbf;, score=0.815 total time=   0.0s
[CV 4/5] END .........C=10, gamma=1, kernel=rbf;, score=0.800 total time=   0.0s
[CV 5/5] END .........C=10, gamma=1, kernel=rbf;, score=0.800 total time=   0.0s
[CV 1/5] END .......C=10, gamma=0.1, kernel=rbf;, score=0.978 total time=   0.0s
[CV 2/5] END .......C=10, gamma=0.1, kernel=rbf;, score=0.978 total time=   0.0s
[CV 3/5] END .......C=10, gamma=0.1, kernel=rbf;, score=0.963 total time=   0.0s
[CV 4/5] END .......C=10, gamma=0.1, kernel=rbf;, score=0.956 total time=   0.0s
[CV 5/5] END .......C=10, gamma=0.1, kernel=rbf;, score=0.963 total time=   0.0s
[CV 1/5] END ......C=10, gamma=0.01, kernel=rbf;, score=0.985 total time=   0.0s
[CV 2/5] END ......C=10, gamma=0.01, kernel=rbf;, score=0.970 total time=   0.0s
[CV 3/5] END ......C=10, gamma=0.01, kernel=rbf;, score=0.956 total time=   0.0s
[CV 4/5] END ......C=10, gamma=0.01, kernel=rbf;, score=0.963 total time=   0.0s
[CV 5/5] END ......C=10, gamma=0.01, kernel=rbf;, score=0.956 total time=   0.0s
[CV 1/5] END .....C=10, gamma=0.001, kernel=rbf;, score=0.919 total time=   0.0s
[CV 2/5] END .....C=10, gamma=0.001, kernel=rbf;, score=0.933 total time=   0.0s
[CV 3/5] END .....C=10, gamma=0.001, kernel=rbf;, score=0.874 total time=   0.0s
[CV 4/5] END .....C=10, gamma=0.001, kernel=rbf;, score=0.911 total time=   0.0s
[CV 5/5] END .....C=10, gamma=0.001, kernel=rbf;, score=0.881 total time=   0.0s
[CV 1/5] END ....C=10, gamma=0.0001, kernel=rbf;, score=0.507 total time=   0.0s
[CV 2/5] END ....C=10, gamma=0.0001, kernel=rbf;, score=0.519 total time=   0.0s
[CV 3/5] END ....C=10, gamma=0.0001, kernel=rbf;, score=0.526 total time=   0.0s
[CV 4/5] END ....C=10, gamma=0.0001, kernel=rbf;, score=0.511 total time=   0.0s
[CV 5/5] END ....C=10, gamma=0.0001, kernel=rbf;, score=0.511 total time=   0.0s
Out[ ]:
GridSearchCV(estimator=SVC(),
             param_grid={'C': [0.1, 1, 10],
                         'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
                         'kernel': ['rbf']},
             verbose=3)
{'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}
SVC(C=10, gamma=0.1)
In [ ]:
# Running Best model and Printing Accuracy
autosvm = SVC(gamma=0.1, C=10)
autosvm.fit(X_train , y_train)
y_pred = autosvm.predict(X_test)
# Evaluate accuracy
print(accuracy_score(y_test, y_pred))
Out[ ]:
SVC(C=10, gamma=0.1)
0.9823529411764705

Training SVM model with PCA¶

In [ ]:
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(Xpca10, y, test_size=0.2, random_state=7)
In [ ]:
autosvm_pca = SVC(gamma=0.1, C=10)
autosvm_pca.fit(X_train , y_train)
y_pred = autosvm_pca.predict(X_test)
Out[ ]:
SVC(C=10, gamma=0.1)
In [ ]:
# Evaluate accuracy
print(accuracy_score(y_test, y_pred))
0.9647058823529412

Observation¶

  1. Accuracy score for SVM without PCA is 98.23% whereas with PCA is 96.47. It can be seen that which large number of dimensions reduction (18 -> 10), the accuracy drop is marginal i.e of ~2%.
  2. But with less features, computation efficiency of the model will improve.

Part 4 DOMAIN: Sports management¶


CONTEXT: Company X is a sports management company for international cricket.
DATA DESCRIPTION: The data is collected belongs to batsman from IPL series conducted so far.
Attribute Information:

  1. Runs: Runs score by the batsman
  2. Ave: Average runs scored by the batsman per match
  3. SR: strike rate of the batsman
  4. Fours: number of boundary/four scored
  5. Six: number of boundary/six scored
  6. HF: number of half centuries scored so far

PROJECT OBJECTIVE: Goal is to build a data driven batsman ranking model for the sports management company to make business decisions.

Steps and tasks: [ Total Score: 5 points]

  1. EDA and visualisation: Create a detailed performance report using univariate, bi-variate and multivariate EDA techniques. Find out all possible hidden patterns by using all possible methods.
  2. Build a data driven model to rank all the players in the dataset using all or the most important performance features.
In [ ]:
# Import First dataset and check shape
df_BatsmRank = pd.read_csv("Part4 - batting_bowling_ipl_bat.csv")
df_BatsmRank.sample(5)
df_BatsmRank.shape
Out[ ]:
Name Runs Ave SR Fours Sixes HF
144 NaN NaN NaN NaN NaN NaN NaN
143 MJ Clarke 98.0 16.33 104.25 12.0 0.0 0.0
83 IK Pathan 176.0 25.14 139.68 14.0 6.0 0.0
5 V Sehwag 495.0 33.00 161.23 57.0 19.0 5.0
104 NaN NaN NaN NaN NaN NaN NaN
Out[ ]:
(180, 7)
In [ ]:
# Checking type of Data
df_BatsmRank.dtypes
Out[ ]:
Name      object
Runs     float64
Ave      float64
SR       float64
Fours    float64
Sixes    float64
HF       float64
dtype: object
In [ ]:
# Checking standard null values in data
df_BatsmRank.isnull()
Out[ ]:
Name Runs Ave SR Fours Sixes HF
0 True True True True True True True
1 False False False False False False False
2 True True True True True True True
3 False False False False False False False
4 True True True True True True True
... ... ... ... ... ... ... ...
175 False False False False False False False
176 True True True True True True True
177 False False False False False False False
178 True True True True True True True
179 False False False False False False False

180 rows × 7 columns

In [ ]:
df_BatsmRank_pca = df_BatsmRank.dropna()
Names = df_BatsmRank_pca["Name"]
Names.reset_index(inplace = True, drop = True)
df_BatsmRank_pca.head(5)
Out[ ]:
Name Runs Ave SR Fours Sixes HF
1 CH Gayle 733.0 61.08 160.74 46.0 59.0 9.0
3 G Gambhir 590.0 36.87 143.55 64.0 17.0 6.0
5 V Sehwag 495.0 33.00 161.23 57.0 19.0 5.0
7 CL White 479.0 43.54 149.68 41.0 20.0 5.0
9 S Dhawan 569.0 40.64 129.61 58.0 18.0 5.0
In [ ]:
from scipy.stats import zscore
df_BatsmRank_pca = df_BatsmRank_pca.drop('Name', axis=1)
df_BatsmRank_pca=df_BatsmRank_pca.apply(zscore)
df_BatsmRank_pca.head()
Out[ ]:
Runs Ave SR Fours Sixes HF
1 3.301945 2.683984 1.767325 1.607207 6.462679 4.651551
3 2.381639 0.896390 1.036605 2.710928 1.184173 2.865038
5 1.770248 0.610640 1.788154 2.281703 1.435530 2.269533
7 1.667276 1.388883 1.297182 1.300618 1.561209 2.269533
9 2.246490 1.174755 0.444038 2.343021 1.309851 2.269533
In [ ]:
covMatrix = np.cov(df_BatsmRank_pca,rowvar=False)
print(covMatrix)
[[1.01123596 0.70077082 0.49903347 0.9291323  0.77842677 0.84453142]
 [0.70077082 1.01123596 0.63061271 0.55234856 0.69008186 0.62772842]
 [0.49903347 0.63061271 1.01123596 0.38913406 0.59050396 0.43238784]
 [0.9291323  0.55234856 0.38913406 1.01123596 0.52844526 0.79249429]
 [0.77842677 0.69008186 0.59050396 0.52844526 1.01123596 0.77632221]
 [0.84453142 0.62772842 0.43238784 0.79249429 0.77632221 1.01123596]]
In [ ]:
from sklearn.decomposition import PCA
pca = PCA(n_components=6)
pca.fit(df_BatsmRank_pca)
Out[ ]:
PCA(n_components=6)
In [ ]:
# Eigen values
print(pca.explained_variance_)
[4.30252561 0.83636692 0.41665751 0.32912443 0.16567829 0.01706297]
In [ ]:
# Eigen vectors
print(pca.components_)
[[ 0.4582608   0.39797313  0.3253838   0.40574167  0.41733459  0.43237178]
 [ 0.26643209 -0.33111756 -0.69780334  0.47355804 -0.17902455  0.27593225]
 [-0.10977942  0.00550486 -0.45013448 -0.50823538  0.66942589  0.28082541]
 [-0.00520142  0.84736307 -0.43275029 -0.03252305 -0.24878157 -0.17811777]
 [ 0.45840889 -0.10122837 -0.11890348  0.09676885  0.39458014 -0.77486668]
 [ 0.70483594 -0.0606373   0.05624934 -0.58514214 -0.35786211  0.16096217]]
In [ ]:
# Explainabilty by each eigen vector
print(pca.explained_variance_ratio_)
[0.70911996 0.13784566 0.06867133 0.05424458 0.02730624 0.00281223]
In [ ]:
plt.bar(list(range(1,7)),pca.explained_variance_ratio_,alpha=0.5, align='center');
plt.ylabel('Variation explained');
plt.xlabel('eigen Value');
plt.show();
In [ ]:
plt.step(list(range(1,7)),np.cumsum(pca.explained_variance_ratio_), where='mid');
plt.ylabel('Cum of variation explained');
plt.xlabel('eigen Value');
plt.show();

Four Dimension seems reasonable since we can explain more than 95% variablity using these.¶

Reducing dimentionality with Four PCA's¶

In [ ]:
pca4 = PCA(n_components=4)
pca4.fit(df_BatsmRank_pca)
Xpca4 = pca4.transform(df_BatsmRank_pca)
Xpca4.shape
Out[ ]:
PCA(n_components=4)
Out[ ]:
(90, 4)
In [ ]:
# Eigen values and Eigen vectors
Values = pd.DataFrame(pca4.explained_variance_)
Vectors = pd.DataFrame(pca4.explained_variance_ratio_)
In [ ]:
# Score calculation for each feature
Score = Values*(Vectors)
print(Score)
Score.shape
          0
0  3.051007
1  0.115290
2  0.028612
3  0.017853
Out[ ]:
(4, 1)
In [ ]:
Final_Score = pd.DataFrame(Xpca4.dot(Score));
In [ ]:
NameScore = Final_Score.join(Names)
NameScore.head()
Out[ ]:
0 Name
0 26.031149 CH Gayle
1 14.235813 G Gambhir
2 12.656784 V Sehwag
3 11.905379 CL White
4 12.728287 S Dhawan
In [ ]:
# Sort players with highest ranking first
NameScore.sort_values(0,axis=0,ascending=False)
Out[ ]:
0 Name
0 26.031149 CH Gayle
1 14.235813 G Gambhir
4 12.728287 S Dhawan
2 12.656784 V Sehwag
5 12.479462 AM Rahane
... ... ...
86 -9.010853 WD Parnell
85 -9.035490 Z Khan
87 -9.169080 PC Valthaty
88 -10.212834 RP Singh
89 -11.675565 R Sharma

90 rows × 2 columns

Part 5¶


Question: List down all possible dimensionality reduction techniques that can be implemented using python.
Answer:


Feature Projection based:

  1. Principal Components Analysis or Kernel PCA
  2. Singular Value Decomposition
  3. Linear Discriminant Analysis
  4. t-distributed Stochastic Neighbor Embedding (t-SNE)
  5. Autoencoders
  6. Uniform Manifold Approximation and Projection (UMAP)


Feature selection based:

  1. Lasso regression
  2. Filter based Correlation/statistics, Factor Analysis, Low Variance Filter
  3. Decision tree based feature selection
  4. Random Forest

So far you have used dimensional reduction on numeric data. Is it possible to do the same on a multimedia data [images and video] and text data ? Please illustrate your findings using a simple implementation on python
Answer. Yes, Its possible to use dimentionality reduction techniques to Multimedia and text data

  1. SVD for Natural launguage processing
  2. latent dirichlet allocation (LDA)
  3. Using Pre-trained models like word2vec
  4. Term-Frequency-Inverse-Document-Frequency, HashingVectorizer
  5. PCA, Factor Analysis
  6. Clustering text documents using k-means with Bag of words approach
In [ ]:
# Consider an image data set of 8*8 Size having digits in it.
from sklearn.datasets import load_digits

# Import Factor analysis module
from sklearn.decomposition import FactorAnalysis

# Load digits data and getting its shape
X, _ = load_digits(return_X_y=True)
print("Original size",X.shape);

# Using Factor analysis to reduce dimention
transformer = FactorAnalysis(n_components=7, random_state=0)
X_transformed = transformer.fit_transform(X)
print("Tranformed size",X_transformed.shape);
Original size (1797, 64)
Tranformed size (1797, 7)
In [3]:
!jupyter nbconvert --to='html' '/content/drive/MyDrive/AI_ML_course/Unsupervised_Learning/Unsupervised_Learning.ipynb'
[NbConvertApp] WARNING | pattern '/content/drive/MyDrive/AI_ML_course/Unsupervised_Learning/Unsupervised_Learning.ipynb' matched no files
This application is used to convert notebook files (*.ipynb)
        to various other formats.

        WARNING: THE COMMANDLINE INTERFACE MAY CHANGE IN FUTURE RELEASES.

Options
=======
The options below are convenience aliases to configurable class-options,
as listed in the "Equivalent to" description-line of the aliases.
To see all configurable class-options for some <cmd>, use:
    <cmd> --help-all

--debug
    set log level to logging.DEBUG (maximize logging output)
    Equivalent to: [--Application.log_level=10]
--show-config
    Show the application's configuration (human-readable format)
    Equivalent to: [--Application.show_config=True]
--show-config-json
    Show the application's configuration (json format)
    Equivalent to: [--Application.show_config_json=True]
--generate-config
    generate default config file
    Equivalent to: [--JupyterApp.generate_config=True]
-y
    Answer yes to any questions instead of prompting.
    Equivalent to: [--JupyterApp.answer_yes=True]
--execute
    Execute the notebook prior to export.
    Equivalent to: [--ExecutePreprocessor.enabled=True]
--allow-errors
    Continue notebook execution even if one of the cells throws an error and include the error message in the cell output (the default behaviour is to abort conversion). This flag is only relevant if '--execute' was specified, too.
    Equivalent to: [--ExecutePreprocessor.allow_errors=True]
--stdin
    read a single notebook file from stdin. Write the resulting notebook with default basename 'notebook.*'
    Equivalent to: [--NbConvertApp.from_stdin=True]
--stdout
    Write notebook output to stdout instead of files.
    Equivalent to: [--NbConvertApp.writer_class=StdoutWriter]
--inplace
    Run nbconvert in place, overwriting the existing notebook (only
            relevant when converting to notebook format)
    Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory=]
--clear-output
    Clear output of current file and save in place,
            overwriting the existing notebook.
    Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory= --ClearOutputPreprocessor.enabled=True]
--no-prompt
    Exclude input and output prompts from converted document.
    Equivalent to: [--TemplateExporter.exclude_input_prompt=True --TemplateExporter.exclude_output_prompt=True]
--no-input
    Exclude input cells and output prompts from converted document.
            This mode is ideal for generating code-free reports.
    Equivalent to: [--TemplateExporter.exclude_output_prompt=True --TemplateExporter.exclude_input=True --TemplateExporter.exclude_input_prompt=True]
--allow-chromium-download
    Whether to allow downloading chromium if no suitable version is found on the system.
    Equivalent to: [--WebPDFExporter.allow_chromium_download=True]
--disable-chromium-sandbox
    Disable chromium security sandbox when converting to PDF..
    Equivalent to: [--WebPDFExporter.disable_sandbox=True]
--show-input
    Shows code input. This flag is only useful for dejavu users.
    Equivalent to: [--TemplateExporter.exclude_input=False]
--embed-images
    Embed the images as base64 dataurls in the output. This flag is only useful for the HTML/WebPDF/Slides exports.
    Equivalent to: [--HTMLExporter.embed_images=True]
--sanitize-html
    Whether the HTML in Markdown cells and cell outputs should be sanitized..
    Equivalent to: [--HTMLExporter.sanitize_html=True]
--log-level=<Enum>
    Set the log level by value or name.
    Choices: any of [0, 10, 20, 30, 40, 50, 'DEBUG', 'INFO', 'WARN', 'ERROR', 'CRITICAL']
    Default: 30
    Equivalent to: [--Application.log_level]
--config=<Unicode>
    Full path of a config file.
    Default: ''
    Equivalent to: [--JupyterApp.config_file]
--to=<Unicode>
    The export format to be used, either one of the built-in formats
            ['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'rst', 'script', 'slides', 'webpdf']
            or a dotted object name that represents the import path for an
            ``Exporter`` class
    Default: ''
    Equivalent to: [--NbConvertApp.export_format]
--template=<Unicode>
    Name of the template to use
    Default: ''
    Equivalent to: [--TemplateExporter.template_name]
--template-file=<Unicode>
    Name of the template file to use
    Default: None
    Equivalent to: [--TemplateExporter.template_file]
--theme=<Unicode>
    Template specific theme(e.g. the name of a JupyterLab CSS theme distributed
    as prebuilt extension for the lab template)
    Default: 'light'
    Equivalent to: [--HTMLExporter.theme]
--sanitize_html=<Bool>
    Whether the HTML in Markdown cells and cell outputs should be sanitized.This
    should be set to True by nbviewer or similar tools.
    Default: False
    Equivalent to: [--HTMLExporter.sanitize_html]
--writer=<DottedObjectName>
    Writer class used to write the
                                        results of the conversion
    Default: 'FilesWriter'
    Equivalent to: [--NbConvertApp.writer_class]
--post=<DottedOrNone>
    PostProcessor class used to write the
                                        results of the conversion
    Default: ''
    Equivalent to: [--NbConvertApp.postprocessor_class]
--output=<Unicode>
    overwrite base name use for output files.
                can only be used when converting one notebook at a time.
    Default: ''
    Equivalent to: [--NbConvertApp.output_base]
--output-dir=<Unicode>
    Directory to write output(s) to. Defaults
                                  to output to the directory of each notebook. To recover
                                  previous default behaviour (outputting to the current
                                  working directory) use . as the flag value.
    Default: ''
    Equivalent to: [--FilesWriter.build_directory]
--reveal-prefix=<Unicode>
    The URL prefix for reveal.js (version 3.x).
            This defaults to the reveal CDN, but can be any url pointing to a copy
            of reveal.js.
            For speaker notes to work, this must be a relative path to a local
            copy of reveal.js: e.g., "reveal.js".
            If a relative path is given, it must be a subdirectory of the
            current directory (from which the server is run).
            See the usage documentation
            (https://nbconvert.readthedocs.io/en/latest/usage.html#reveal-js-html-slideshow)
            for more details.
    Default: ''
    Equivalent to: [--SlidesExporter.reveal_url_prefix]
--nbformat=<Enum>
    The nbformat version to write.
            Use this to downgrade notebooks.
    Choices: any of [1, 2, 3, 4]
    Default: 4
    Equivalent to: [--NotebookExporter.nbformat_version]

Examples
--------

    The simplest way to use nbconvert is

            > jupyter nbconvert mynotebook.ipynb --to html

            Options include ['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'rst', 'script', 'slides', 'webpdf'].

            > jupyter nbconvert --to latex mynotebook.ipynb

            Both HTML and LaTeX support multiple output templates. LaTeX includes
            'base', 'article' and 'report'.  HTML includes 'basic', 'lab' and
            'classic'. You can specify the flavor of the format used.

            > jupyter nbconvert --to html --template lab mynotebook.ipynb

            You can also pipe the output to stdout, rather than a file

            > jupyter nbconvert mynotebook.ipynb --stdout

            PDF is generated via latex

            > jupyter nbconvert mynotebook.ipynb --to pdf

            You can get (and serve) a Reveal.js-powered slideshow

            > jupyter nbconvert myslides.ipynb --to slides --post serve

            Multiple notebooks can be given at the command line in a couple of
            different ways:

            > jupyter nbconvert notebook*.ipynb
            > jupyter nbconvert notebook1.ipynb notebook2.ipynb

            or you can specify the notebooks list in a config file, containing::

                c.NbConvertApp.notebooks = ["my_notebook.ipynb"]

            > jupyter nbconvert --config mycfg.py

To see all available configurables, use `--help-all`.